Statistical learning: non-parametric methods

MACS 30100
University of Chicago

March 6, 2017

Parametric methods

  1. First make an assumption about the functional form of \(f\)
  2. After a model has been selected, fit or train the model using the actual data

OLS

Parametric methods

\[Y = \beta_0 + \beta_{1}X_1\]

  • \(Y =\) sales
  • \(X_{1} =\) advertising spending in a given medium
  • \(\beta_0 =\) intercept
  • \(\beta_1 =\) slope

Non-parametric methods

  • No/minimal assumptions about functional form
  • Use data to estimate \(f\) directly
    • Get close to data points
    • Avoid overcomplexity
  • Requires large amount of observations

Two types of non-parametric

  1. Techniques that do not rely on data belonging to any particular distribution
  2. Techniques that do not assume a global structure

Non-parametric methods for description

  • Data description

Measures of central tendency

  • Median
  • Mode
  • Arithmetic mean

    \[\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\]

Measures of dispersion

  • Variance

    \[E[X] = \mu\]

    \[\text{Var}(X) \equiv \sigma^2 = E[X^2] - (E[X])^2\]

  • Deviation
  • Standard deviation

    \[\sigma = \sqrt{E[X^2] - (E[X])^2}\]

  • Median absolute deviation

    \[MAD = \text{median}(|X_i - \text{median}(X)|)\]

Histograms

Density estimation

  • Nonparametric density estimation

    \[x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh\]

    \[\hat{p}(x) = \frac{\#_{i = 1}^n [x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh]}{2nh}\]

    \[\hat{p}(x) = \frac{\#_{i = 1}^n [x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh]}{2nh}\]

    \[\hat{p}(x) = \frac{1}{nh} \sum_{i = 1}^n W \left( \frac{x - X_i}{h} \right)\]

    \[W(z) = \begin{cases} \frac{1}{2} & \text{for } |z| < 1 \\ 0 & \text{otherwise} \\ \end{cases}\]

    \[z = \frac{x - X_i}{h}\]

Naive density estimation

Density estimation

  • Kernels

    \[\hat{x}(x) = \frac{1}{nh} \sum_{i = 1}^k K \left( \frac{x - X_i}{h} \right)\]

Gaussian kernel

\[K(z) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2} z^2}\]

Rectangular (uniform) kernel

\[K(z) = \frac{1}{2} \mathbf{1}_{\{ |z| \leq 1 \} }\]

Triangular kernel

\[K(z) = (1 - |z|) \mathbf{1}_{\{ |z| \leq 1 \} }\]

Quartic (biweight) kernel

\[K(z) = \frac{15}{16} (1 - z^2)^2 \mathbf{1}_{\{ |z| \leq 1 \} }\]

Epanechnikov kernel

\[K(z) = \frac{3}{4} (1 - z^2) \mathbf{1}_{\{ |z| \leq 1 \} }\]

Comparison of kernels

Selecting the bandwidth \(h\)

Selecting the bandwidth \(h\)

\[h = 0.9 \sigma n^{-1 / 5}\]

\[A = \min \left( S, \frac{IQR}{1.349} \right)\]

Naive non-parametric regression

Naive non-parametric regression

\[\mu = E(\text{Income}|\text{Education}) = f(\text{Education})\]

Naive non-parametric regression

\[\mu = E(Y|x) = f(x)\]

  • Binning

Naive non-parametric regression

Naive non-parametric regression

Naive non-parametric regression

\[X_1 \in \{1, 2, \dots ,10 \}\] \[X_2 \in \{1, 2, \dots ,10 \}\] \[X_3 \in \{1, 2, \dots ,10 \}\]

  • \(10^3 = 1000\) possible combinations of the explanatory variables and \(1000\) conditional expectations of \(Y\) given \(X\):

\[\mu = E(Y|x_1, x_2, x_3) = f(x_1, x_2, x_3)\]

Naive non-parametric regression

Naive non-parametric regression

Naive non-parametric regression

\(K\)-nearest neighbors regression

\[\hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in N_0} y_i\]

\(K\)-nearest neighbors regression

\(K\)-nearest neighbors regression

\(K\)-nearest neighbors regression

\(K\)-nearest neighbors regression

\[f(x) = 2 + x + \epsilon_i\]

\(K\)-nearest neighbors regression

\[f(x) = 2 + x + x^2 + x^3 + \epsilon_i\]

\(K\)-nearest neighbors regression

\[f(x) = 2 + x + x^2 + x^3 + \epsilon_i\]

Weighted \(K\)-nearest neighbors regression

\[\text{Distance}(x_i, y_i) = \left( \sum_{i = 1}^n |x_i - y_i| ^p \right)^\frac{1}{p}\]

Weighted \(K\)-nearest neighbors regression

Weighted \(K\)-nearest neighbors regression

Estimating KNN on simulated wage data

KNN on Biden

Weighted KNN on Biden

Non-linearity of linear models

Parametric methods

  • Linear regression
  • Logistic regression
  • Generalized linear models (GLMs)
  • Polynomial regression
  • Step functions

Non-parametric methods

  • Regression splines
  • Smoothing splines
  • Local regression
  • Generalized additive models
  • Decision trees
  • Bagging/random forest/boosting
  • Support vector machines

Bayes decision rule

\[\Pr(Y = j | X = x_0)\]

Bayes decision rule

Bayes error rule

\[1 - E \left( \max_j \Pr(Y = j | X) \right)\]

\(K\)-nearest neighbors classification

\(Pr(Y = j| X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j)\)$

\(K\)-nearest neighbors classification

\(K\)-nearest neighbors classification

\(K\)-nearest neighbors classification

\(K\)-nearest neighbors classification

Applying KNN to Titanic